11 research outputs found

    Run-Time Efficient RNN Compression for Inference on Edge Devices

    Full text link
    Recurrent neural networks can be large and compute-intensive, yet many applications that benefit from RNNs run on small devices with very limited compute and storage capabilities while still having run-time constraints. As a result, there is a need for compression techniques that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper explores a new compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) that achieves this dual objective. This scheme divides the weight matrix into two parts - an unconstrained upper half and a lower half composed of rank-1 blocks. This results in output features where the upper sub-vector has "richer" features while the lower-sub vector has "constrained features". HMD can compress RNNs by a factor of 2-4x while having a faster run-time than pruning (Zhu &Gupta, 2017) and retaining more model accuracy than matrix factorization (Grachev et al., 2017). We evaluate this technique on 5 benchmarks spanning 3 different applications, illustrating its generality in the domain of edge computing.Comment: Published at 4th edition of Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications at International Symposium of Computer Architecture 2019, Phoenix, Arizona (https://www.emc2-workshop.com/isca-19) colocated with ISCA 201

    Design of heterogeneous coherence hierarchies using manager-client pairing

    Get PDF
    Over the past ten years, the architecture community has witnessed the end of single-threaded performance scaling and a subsequent shift in focus toward multicore and manycore processing. While this is an exciting time for architects, with many new opportunities and design spaces to explore, this brings with it some new challenges. One area that is especially impacted is the memory subsystem. Specifically, the design, verification, and evaluation of cache coherence protocols becomes very challenging as cores become more numerous and more diverse. This dissertation examines these issues and presents Manager-Client Pairing as a solution to the challenges facing next-generation coherence protocol design. By defining a standardized coherence communication interface and permissions checking algorithm, Manager-Client Pairing enables coherence hierarchies to be constructed and evaluated quickly without the high design-cost previously associated with hierarchical composition. Further, Manager-Client Pairing also allows for verification composition, even in the presence of protocol heterogeneity. As a result, this rapid development of diverse protocols is ensured to be bug-free, enabling architects to focus on performance optimization, rather than debugging and correctness concerns, while comparing diverse coherence configurations for use in future heterogeneous systems.PhDCommittee Chair: Conte, Thomas; Committee Member: Patt, Yale; Committee Member: Prvulovic, Milos; Committee Member: Ramachandran, Umakishore; Committee Member: Yalamanchili, Sudhaka

    PMPT – PERFORMANCE MONITORING PEBS TOOL

    No full text
    For many applications a common source of performance degradation is excessive processor stalling from high memory latencies or poor data placement. Performance degradations from program and memory hierarchy interactions are often difficult for programmers and compilers to correct due to a lack of run-time information or limited knowledge about the underlying problem. By leveraging the Pentium 4 processor’s performance monitoring hardware, specific run-time information can be provided, allowing code modifications to reduce or even eliminate problematic code, resulting in reduced execution times. Furthermore, many tools currently available to aid programmers are program counter centric. These tools point out which area of the code produce slowdowns, but they do not directly show where the problem data structures are. This is a common problem in programs that dynamically allocate memory. By creating a “malloc-centric ” tool, we can develop an interesting perspective of the memory behavior of the system, providing better insight into the sources of performance problems
    corecore